R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

#Structural Equations Modeling (SEM) Example
#BIO 202 - Ecological Statistics 
#Stanford University 
#Fall 2022
#Instructors: Dr. Tad Fukami and Dr. Jesse Miller

#Project Background: 
##Research Questions - 
#1.Does Pollution impact Asthma?
#2.Does Pollution and Poverty impact Asthma?

#Location: The entire state of California 

#Why is this study important? This study identifies metrics and techniques 
#that are useful to quantify the risk and impacts of Pollution on 
#vulnerable populations in California. Moreover, this study allows 
#us to evaluate the causation of risks and impacts to help 
#better inform decision-making to guide 
#policy and investment in under-served and underrepresented communities. 

#Data sample size (n = 8,035 observations)
#8,035 datapoints represent geogrpahic boundaries within California (area).
#Each census tract has approx. 4,200 people on average. 

#Pro-Tips: Make sure to select the PDF format when creating a 
#new RMarkdown file and install tinytex package to Knit file to PDF. 
#Install tinytex package to Knit RMarkdown File to PDF. 
#install.packages('tinytex')
#install.packages("tinytex", repos = "http://cran.us.r-project.org")
#tinytex::install_tinytex()
# to uninstall TinyTeX, run tinytex::uninstall_tinytex()
#Reference: https://yihui.org/tinytex/

#SEM Primary References: 
#1. Grace, James B. 2006. Structural Equation Modeling 
#and Natural Systems, Cambridge University Press
#2. Lefcheck, J. 2019. Structural Equation Modeling in R for Ecology and Evolution.


#Why Do SEM?
#SEM tests the direct and indirect effects on pre-assumed causal relationships:
#Enlisted by ecologists because
#1. Theory oriented
#2. Can analyze complex causal networks
#3. Permits causal inferences
#4. Model flexibility
#5. More intuitive interpretation

#Several conditions are deemed necessary, but not sufficient to establish 
#causation. 
#1. There is an empirical association between the variables - they 
#are significantly correlated. 
#2. A common cause of the two variables has been 
#ruled out, and the two variables have a theoretical connection. 
#3. One variable precedes the other, and if the preceding variable changes, the 
#outcome variable also changes (and not vice versa). 
#4. Even when some of the conditions of causation are not fully met, 
#causal inference may still be justifiable

#Why did SEM work for us and our data? 
#1. We have pre-assumed causal relationships between variables.
#2. Answering our research questions requires causal analysis.

#Why was SEM an appropriate choice for us?

#1. We want to understand the causal relationship between 
#specific variables and we have an a priori hypothesis about this 
#relationship. Furthermore, the data we have is suitable for causal analysis and SEM.

#2. Specifically, we are interested in the relationship between 
#asthma, pollution, and poverty.

#3. Asthma was measured as the age-adjusted 
#rate of emergency department visits for asthma.

#4. Pollution was measured as the average of percentiles from the Pollution Burden 
#indicators, which is an aggregation of different pollution sources from 
#buildings, transportation, and the natural environment. 

#5. Poverty was measured as the percent of population living below two times 
#the federal poverty level.

#6. All of our observed variables were measured as continuous data.

#Data structure is suitable for SEM

#Variables and Descriptions 
#Variable: Asthma: 
#Description: Age-adjusted rate of emergency department visits for asthma
#Data Source: California Office of Health Hazard Assessment Pollution 
#Data Type Continuous

#Variable: Pollution
##Description: Average of percentiles from the Pollution Burden indicators 
#(with a half weighting for the Environmental Effects indicators)
#Data Source: California Office of Health Hazard Assessment Pollution 
#Data Type Continuous


#Variable: Poverty
#Description: Percent of population living below two times the federal 
#(nationwide) poverty level of $18,755 annually
#Data Source: California Office of Health Hazard Assessment
#Data Type: Continuous

#Variable: EH_2022 = Extreme Heat 
#Description: Number of extreme heat days in the year 2022
#Data Source: Cal-Adapt
#Data Type: Continuous

#Fundamentals of SEM 
#The independent (x) variables are referred to as exogenous, while the 
#dependent (y) variables are called endogenous.
#Endogenous variables can have influences on other endogenous variables. 
#Commonly, causal order will flow from left to right across a model, 
#although this will not always be the case.The ways in which variables in a 
#model are connected are important.

#SEM is essentially a linear model framework that models regression equations 
#with latent variables, and therefore it is possible to model the 
#relationship between our variables of interest. To create our model in R, 
#we chose to use the lavaan package, which stands for Latent variable analysis. 
#This package is the most commonly used package for SEM, and in our experience, 
#offers the most streamlined and intuitive process for creating your model and 
#running goodness-of-fit tests. While we’re not planning on getting into 
#latent variables in this presentation, it is important to know that they are 
#basically unobserved variables that we can construct from our observed data. 
#We didn’t use latent variables in our model, but it is possible to include 
#them using lavaan.


#lavaan (LAtent VAriable ANalysis)
#Used for multivariate statistical modeling 
#(i.e path analysis, confirmatory factor analysis, structural equation modeling)
#Streamlined formulas for SEM compared to other packages
#Creates a model and tests Goodness-of-fit

#2 types of variables: 
#1. Observed variable (exists within the dataset) 
#2. Latent Variable (Unobserved data created from existing variables

#The SEM process is composed of the following five steps and decisions:
#1. Construct a path diagram that shows the measurement and 
#structural model of interest.
#2.Identify the level of measurement for each item and 
#check distributional assumptions.
#3. Ensure that the fitting function you chose is based on 
#measurement types (e.g., maximum likelihood for 
#continuous measures, weighted least square for ordinal measures).
#4. Move through the model testing process in a logical fashion.
#5. Fit the model using the appropriate fitting function and carefully 
#assess model fit by using a set of indexes.

#Checklist
#Part 1. Hypothesis  
#1.1 Develop a hypothesis to test. For this study we used 
#Research Questions
#1.Does Pollution impact Asthma?
#2.Does Pollution and Poverty impact Asthma?
#1.2 Establish paths of causal correlation between variables and 
#the meaning for using these variables 
#1.3 Select at least 3 variables that you believe have 
#causal correlation to establish paths for influencing outcome 
#1.4 Load in data
#1.5 Identify your observed or latent variables (unobserved or lagging)
#1.6 Data review to check for absent data, No NA's, no missing data points

#Part 2. Null/Baseline Model and User Model Evaluation
#2.1 There may be other ways to analyze categorical or ordinary data but we 
#focused on using continuous 
#2.2 This is how we applied SEM for our dataset composed of continuous data
#2.3 Model specification and create measurement models
#2.4 Choose the function to apply to SEM analysis such as linear regression
#2.5 Read in your data into a new model SEM_M1
#2.6 Make a new dataframe for SEM_M1
#2.7 Layering SEM_M2 and SEM_M3 over df1 similar to a filter 
#2.8 What is the reason to include 1 in the models 
#(Why would we want to change the intercept (mean)?)
#2.9 Paths of causal correlation are being tested
#2.10 Make sure there is no covariance testing inside measurement models 
#(repetitive and throws off the model analysis) ()

#Part 3. Path Analysis 
#3.1 Verify that your apriori assumption was correct when you first 
#constructed a path diagram prior to modeling. 
#3.2 Look at the direct and/or indirect effects of causal correlation.
#3.3 Graph global p-value and R-Squared to assess quality of hypothesis
#and model tests

#install.package(lavaan) --- Latent Variable Analysis 
#(however we did not use latent variables)
#only used observed variables for this study 
#(main package to perform SEM analysis)
#install.package(sem) -- may interfere with lavaan package 
#install.packages("semPaths") -- This function creates a path diagram 
#of a SEM model (or general linear model), which is then plotted using qgraph
#install.packages("semptools") -- Provides tools for 
#structural equation modeling, many of which extend the 'lavaan' pack- age; 
#for example, to pool results from multiple imputations, 
#probe latent interactions, or test measurement invariance.

#set working directory for file reference locations inside from my computer
setwd("~/Documents/GitHub/devanaddisonturner.github.io/BIO 202_Ecological Statistics/SEM Models")
#loads in Lavaan package
library(lavaan)
## This is lavaan 0.6-12
## lavaan is FREE software! Please report any bugs.
#imports SEM_Example_Data csv. file from my working directory as SEM Model 1
SEM_M1 <- read.csv("SEM_Example_Data_California.csv")
#Make a new dataframe for SEM_M1
df1<-data.frame(SEM_M1) 
df2<-na.omit(df1)

#Hypothesis Testing
#3 variables
#2 models using linear regression 
#Continuous data

#Data Preparation 
#SEM in lavaan is sensitive to NAs and continuous data
#Before setting up model, omit rows with missing data

#Syntax structure reference using Lavaan
#https://lavaan.ugent.be/tutorial/syntax1.html


#How do we include Extreme Heat in both M2 and M3 for future researcn?

##Baseline Model
  SEM_M2 <-   ' 
  # regressions
    Poverty ~ Pollution
  # Poverty ~ Pollution*EH_2022
  # Pollution ~ Poverty 
  # Asthma ~ Pollution
    Asthma ~ Poverty
'
# Poverty ~~ Pollution
#SEM_M2_LM <- lm(Asthma ~ Poverty, data = SEM_M1) 
SEM_M2_LM1 <- lm(Asthma ~ Pollution, data = SEM_M1)
#Pollution Burden is correlated with the 
#rate of Asthma in California 
#The rate of Asthma in California increases with Pollution Burden
summary(SEM_M2_LM1)
## 
## Call:
## lm(formula = Asthma ~ Pollution, data = SEM_M1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -55.12 -20.97  -6.58  13.26 193.13 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.94979    1.16723   26.52   <2e-16 ***
## Pollution    0.49292    0.02621   18.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.92 on 8022 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.04221,    Adjusted R-squared:  0.04209 
## F-statistic: 353.6 on 1 and 8022 DF,  p-value: < 2.2e-16
library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
plot(lm(Asthma ~ Pollution, data = SEM_M1))

plot(allEffects(SEM_M2_LM1)) 

#call in sem function to be performed into new model that will be fitted 
#from SEM analysis from SEM_M2
fit1b <- sem(SEM_M2, data=df2)   #Baseline Model
#fit the model then view output
summary(fit1b)
## lavaan 0.6-12 ended normally after 1 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         4
## 
##   Number of observations                          7959
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 5.113
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.024
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Poverty ~                                           
##     Pollution         0.538    0.015   36.006    0.000
##   Asthma ~                                            
##     Poverty           0.834    0.016   51.258    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Poverty         287.074    4.551   63.083    0.000
##    .Asthma          702.587   11.137   63.083    0.000
#Syntax Structure
#Reference: https://benwhalley.github.io/just-enough-r/path-models.html
#  m ~ x
# y ~ x + m

#User Model
SEM_M3 <-   '  
  # regressions
  # Pollution ~ Poverty 
    Poverty ~ Pollution
  # Poverty ~ Pollution*EH_2022
    Asthma ~ Poverty + Pollution
  # Asthma ~ Poverty + (Pollution*EH_2022)
  # Asthma ~ Pollution + Poverty
  # EH_2022 ~ Asthma + Pollution
  # Asthma ~ EH_2022
  # EH_2022 ~ Poverty
'

#Example of Syntax Structure
#f3.syn <- '
#Love ~ Money
#Happiness ~ Money + Love
#Travel ~ Happiness + Money
#f3 <- sem(f3.syn, d, conditional.x=FALSE)
#semPaths_default(f3)

SEM_M2_LM2 <- lm(Asthma ~ Poverty + Pollution, data = SEM_M1)
summary(SEM_M2_LM2)
## 
## Call:
## lm(formula = Asthma ~ Poverty + Pollution, data = SEM_M1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.245 -16.247  -5.176  10.478 186.695 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.96577    1.05053  22.813   <2e-16 ***
## Poverty      0.81874    0.01753  46.712   <2e-16 ***
## Pollution    0.05691    0.02519   2.259   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.5 on 7958 degrees of freedom
##   (74 observations deleted due to missingness)
## Multiple R-squared:  0.2487, Adjusted R-squared:  0.2485 
## F-statistic:  1317 on 2 and 7958 DF,  p-value: < 2.2e-16
#Null/Baseline User Model SEM Tests Output
plot(lm(Asthma ~ Poverty + Pollution, data = SEM_M1))

plot(allEffects(SEM_M2_LM2))

#Poverty and Pollution Burden are both correlated with the 
#rate of Asthma in California
#Poverty has the strongest correlation with the rate of Asthma in California
#The rate of Asthma in California increases with Poverty (nationwide) 
#and Pollution Burden
SEM_M1 %>% select(Pollution,  Poverty)  %>% pairs(cex = 0.1)

SEM_M1 %>% select(Pollution,  EH_2022)  %>% pairs(cex = 0.1)

SEM_M1 %>% select(Poverty,  EH_2022)  %>% pairs(cex = 0.1)

#There is a linear relationship between Pollution Burden and Poverty 
#in California
#Pollution Burden and Poverty are correlated with each other
#Poverty is the dominant variable
#Pollution Burden increases with Poverty(low-income communities in California) 
SEM_M2_LM3 <- lm(Asthma ~ EH_2022, data = SEM_M1)
summary(SEM_M2_LM3)
## 
## Call:
## lm(formula = Asthma ~ EH_2022, data = SEM_M1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.368 -21.549  -6.657  13.899 189.904 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.9838     0.9651   43.50   <2e-16 ***
## EH_2022       0.4958     0.0448   11.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.35 on 8020 degrees of freedom
##   (13 observations deleted due to missingness)
## Multiple R-squared:  0.01504,    Adjusted R-squared:  0.01492 
## F-statistic: 122.5 on 1 and 8020 DF,  p-value: < 2.2e-16
#User Model SEM Tests Output
plot(Asthma ~ EH_2022, data = SEM_M1)

plot(allEffects(SEM_M2_LM3))

#Extreme Heat is correlated with the rate of Asthma in California 
#The rate of Asthma increases with Extreme Heat in California
SEM_M2_LM4 <- lm(EH_2022 ~ Pollution, data = SEM_M1)
summary(SEM_M2_LM4)
## 
## Call:
## lm(formula = EH_2022 ~ Pollution, data = SEM_M1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.451  -5.151  -1.025   5.455  26.799 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.744401   0.294721  70.386   <2e-16 ***
## Pollution   -0.013609   0.006615  -2.057   0.0397 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.564 on 8031 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.0005268,  Adjusted R-squared:  0.0004023 
## F-statistic: 4.233 on 1 and 8031 DF,  p-value: 0.03969
plot(lm(EH_2022 ~ Pollution, data = SEM_M1))

plot(allEffects(SEM_M2_LM4))

#Pollution and Extreme Heat are not correlated
SEM_M2_LM5 <- lm(Poverty ~ EH_2022, data = SEM_M1)
summary(SEM_M2_LM5)
## 
## Call:
## lm(formula = Poverty ~ EH_2022, data = SEM_M1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.409 -14.642  -3.331  12.647  64.990 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.53889    0.57527   39.18   <2e-16 ***
## EH_2022      0.43674    0.02673   16.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.97 on 7957 degrees of freedom
##   (76 observations deleted due to missingness)
## Multiple R-squared:  0.03246,    Adjusted R-squared:  0.03234 
## F-statistic:   267 on 1 and 7957 DF,  p-value: < 2.2e-16
plot(Poverty ~ EH_2022, data = SEM_M1)

plot(allEffects(SEM_M2_LM5))

#Extreme Heat and Poverty are correlated
#Poverty increases with Extreme Heat 
SEM_M2_LM6 <- lm(Asthma ~ EH_2022, Poverty, Pollution, data = SEM_M1)
summary(SEM_M2_LM6)
## 
## Call:
## lm(formula = Asthma ~ EH_2022, data = SEM_M1, subset = Poverty, 
##     weights = Pollution)
## 
## Weighted Residuals:
##    Min     1Q Median     3Q    Max 
## -337.7 -120.9  -61.3  166.9  548.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.06467    1.00615   26.90   <2e-16 ***
## EH_2022      1.72985    0.04253   40.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 208.4 on 7958 degrees of freedom
##   (74 observations deleted due to missingness)
## Multiple R-squared:  0.1721, Adjusted R-squared:  0.172 
## F-statistic:  1654 on 1 and 7958 DF,  p-value: < 2.2e-16
plot(lm(Asthma ~ EH_2022, Poverty, Pollution, data = SEM_M1))

plot(allEffects(SEM_M2_LM6))

#Pollution, Poverty, and Extreme Heat are correlated with rate of Asthma 
#in California. The rate of Asthma increases with Pollution, Poverty, 
#and extreme heat
fit1c <- sem(SEM_M3, data=df2)
summary(fit1c)
## lavaan 0.6-12 ended normally after 1 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                          7959
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Poverty ~                                           
##     Pollution         0.538    0.015   36.006    0.000
##   Asthma ~                                            
##     Poverty           0.819    0.018   46.701    0.000
##     Pollution         0.057    0.025    2.262    0.024
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Poverty         287.074    4.551   63.083    0.000
##    .Asthma          702.136   11.130   63.083    0.000
#comparison of variance between two models
anova(fit1b, fit1c)
#ANOVA is a statistical test for estimating how a quantitative 
#dependent variable changes according to the levels of one or more independent 
#variables. ANOVA tests whether there is a difference in means of the 
#groups at each level of the independent variable.
#The null hypothesis (H0) of the ANOVA is no difference in means, 
#and the alternate hypothesis (Ha) is that the 
#means are different from one another.
#Reference: https://www.scribbr.com/statistics/anova-in-r/

#1) The baseline is a null model, typically in which all of your 
#observed variables are constrained to covary with no other variables 
#The covariances are fixed to 0)--just individual variances are estimated. 
#This is what is often taken as a 'reasonable' worst-possible fitting model, 
#against which your fitted model is compared in order to 
#calculate relative indexes of model fit (e.g., CFI/TLI).

#2) The chi-square statistic (labeled as the minimum function test statistic) 
#is used to perform a test of perfect model fit, both for your specified and 
#null/baseline models. It essentially is a measure of deviance between your 
#model-implied variance/covariance matrix, and your 
#observed variance/covariance matrix. 

#3) Standards for what level of model fit is considered "acceptable" 
#may differ from discipline to discipline, but at least according 
#to Hu & Bentler (1999), you are within the realm of what is considered 
#"acceptable". A CFI of .955 is often considered "good". Keep in mind, however, 
#that both TLI and CFI are relative indexes of model fit--they compare 
#the fit of your model to the fit of your (worst fitting) null model. 
#Hu & Bentler (1999) suggested that you interpret/report both a 
#relative and an absolute index of model fit. Absolute indexes of model fit 
#compare the fit of your model to a perfect fitting model--RMSEA and SRMR 
#are a couple of good candidates (the former is often calculated along with a 
#confidence interval, which is nice).

#References: 
#1. Brown, T. A. (2015). Confirmatory factor analysis for applied 
#research (2nd Edition). New York, NY: 

#2. Guilford Press., Hu, L., & Bentler, P. M. (1999). 
#Cutoff criteria for fit indexes in covariance structure analysis: 

#3. Conventional criteria versus new alternatives. Structural Equation Modeling, 
#6, 1-55. 

#4.  Kline, R. B. (2010). Principles and practice of structural equation modeling
#(3rd Edition). New York, NY: Guilford Press.

#5. Little, T. D. (2013). Longitudinal structural equation modeling. New York, 
#NY: Guilford Press.
#https://stats.stackexchange.com/questions/140909/how-do-i-
#interpret-lavaan-output

#The model summary first lists the independent variables being tested 
#(‘Pollution’ and ‘Poverty’). Next is the residual variance (‘Residuals’), 
#which is the variation in the dependent variable that isn’t explained by 
#the independent variables (predictor variables).

#The following columns provide all of the information needed to 
#interpret the model:
#Df shows the degrees of freedom for each variable 
#(number of levels in the variable minus 1).
#Sum sq is the sum of squares 
#(a.k.a. the variation between the group means created by the levels of 
#the independent variable and the overall mean).
#Mean sq shows the mean sum of squares (the sum of squares divided by 
#the degrees of freedom).
#F value is the test statistic from the F-test 
#(the mean square of the variable divided by the mean square of each parameter).
#Pr(>F) is the p-value of the F statistic, 
#and shows how likely it is that the F-value calculated 
#from the F-test would have occurred if the null hypothesis of 
#no difference was true.

#Measurement of Fit for Linear Regression Models #https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/ 
#The RMSE is the square root of the variance of the residuals. 
#It indicates the absolute fit of the model to the data–how close the 
#observed data points are to the model’s predicted values. 
#Whereas R-squared is a relative measure of fit, 
#RMSE is an absolute measure of fit. As the square root of a variance, 
#RMSE can be interpreted as the standard deviation of the unexplained variance. 
#It has the useful property of being in the same units as the response variable. 
#Lower values of RMSE indicate better fit. 
#RMSE is a good measure of how accurately the model predicts the response. 
#It’s the most important criterion for fit if the main purpose of the model 
#is prediction.


#https://www.scribbr.com/statistics/two-way-anova/
#fit statistics - Null/Baseline and User Model Fitting 
#Issue of the accuracy of path coefficients is the evaluation of 
#overall model fit.In historic path analysis, fit of the data to the 
#overall model was not assessed. Maximum likelihood methods as well as other 
#modern solution procedures provide for an additional assessment whether the 
#data is  consistent with the specified model overall (typically through a 
#chi square or related test of goodness of fit).Tests of overall model fit is 
#a major advance associated with modern path analysis

summary(fit1b, fit.measures=TRUE)
## lavaan 0.6-12 ended normally after 1 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         4
## 
##   Number of observations                          7959
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 5.113
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.024
## 
## Model Test Baseline Model:
## 
##   Test statistic                              3476.632
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.999
##   Tucker-Lewis Index (TLI)                       0.996
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -71194.304
##   Loglikelihood unrestricted model (H1)     -71191.747
##                                                       
##   Akaike (AIC)                              142396.608
##   Bayesian (BIC)                            142424.536
##   Sample-size adjusted Bayesian (BIC)       142411.825
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.023
##   90 Percent confidence interval - lower         0.007
##   90 Percent confidence interval - upper         0.044
##   P-value RMSEA <= 0.05                          0.986
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.008
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Poverty ~                                           
##     Pollution         0.538    0.015   36.006    0.000
##   Asthma ~                                            
##     Poverty           0.834    0.016   51.258    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Poverty         287.074    4.551   63.083    0.000
##    .Asthma          702.587   11.137   63.083    0.000
summary(fit1c, fit.measures=TRUE)
## lavaan 0.6-12 ended normally after 1 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                          7959
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Model Test Baseline Model:
## 
##   Test statistic                              3476.632
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -71191.747
##   Loglikelihood unrestricted model (H1)     -71191.747
##                                                       
##   Akaike (AIC)                              142393.495
##   Bayesian (BIC)                            142428.405
##   Sample-size adjusted Bayesian (BIC)       142412.516
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value RMSEA <= 0.05                             NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Poverty ~                                           
##     Pollution         0.538    0.015   36.006    0.000
##   Asthma ~                                            
##     Poverty           0.819    0.018   46.701    0.000
##     Pollution         0.057    0.025    2.262    0.024
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Poverty         287.074    4.551   63.083    0.000
##    .Asthma          702.136   11.130   63.083    0.000
inspect(fit1b)
## $lambda
##           Povrty Asthma Polltn
## Poverty        0      0      0
## Asthma         0      0      0
## Pollution      0      0      0
## 
## $theta
##           Povrty Asthma Polltn
## Poverty   0                   
## Asthma    0      0            
## Pollution 0      0      0     
## 
## $psi
##           Povrty Asthma Polltn
## Poverty   3                   
## Asthma    0      4            
## Pollution 0      0      0     
## 
## $beta
##           Povrty Asthma Polltn
## Poverty        0      0      1
## Asthma         2      0      0
## Pollution      0      0      0
inspect(fit1c)
## $lambda
##           Povrty Asthma Polltn
## Poverty        0      0      0
## Asthma         0      0      0
## Pollution      0      0      0
## 
## $theta
##           Povrty Asthma Polltn
## Poverty   0                   
## Asthma    0      0            
## Pollution 0      0      0     
## 
## $psi
##           Povrty Asthma Polltn
## Poverty   4                   
## Asthma    0      5            
## Pollution 0      0      0     
## 
## $beta
##           Povrty Asthma Polltn
## Poverty        0      0      1
## Asthma         2      0      3
## Pollution      0      0      0
#inspect() scans a data.frame object for errors that may affect the use of 
#functions in model. By default, all variables are checked regarding 
#the class (numeric or factor), missing values, and presence of 
#possible outliers. The function will return a warning if the data 
#looks like unbalanced, has missing values or possible outliers.
parameterEstimates(fit1b)
parameterEstimates(fit1c)
#Let us consider a simple path analysis model:
#Elements of graphical representation include:
#1. observed variables (in boxes)
#2. double-headed curved arrows (representing unresolved correlations)
#3. single-headed arrows (representing directed relationships residual variances

#Typically the symbols representing residual variances (error terms) 
#are not enclosed by squares. 

#There are several basic kinds of path models depending on their 
#structure and interconnections.

#Path coefficients specify values for the parameters associated with 
#pathways between variables.
#‘semPlot’ is a new package that can be used for unified visualizations of 
#SEM models.
#Path diagrams and visual analysis of various SEM packages' output. 
#Path diagrams including visualizations of the parameter estimates 
#can be plotted with semPaths and visualizations of the implied and 
#observed correlation structures
library(semPlot)
p_1b <- semPaths(fit1b, whatLabels = "est", #Baseline Model
         #  object, rsquare = TRUE
           sizeMan = 10,
           edge.label.cex = 1.15,
           style = "ram",
           nCharNodes = 0, nCharEdges = 0)

p_1c <- semPaths(fit1c, whatLabels = "est", #User Model
           sizeMan = 10,
           edge.label.cex = 1.15,
           style = "ram",
           nCharNodes = 0, nCharEdges = 0)

#load package ("semptools")
#plot significant paths based on p-value
#We know from the lavaan::lavaan() output that some paths are significant 
#and some are not. In some disciplines, asterisks are conventionally added 
#indicate this. However, semPlot::semPaths() does not do this. 
#We can use mark_sig() to add asterisks based on the p-values 
#of the free parameters.
#Reference 
#https://cran.r-project.org/web/packages/semptools/vignettes/semptools.html
library(semptools)
p_1b_1 <- mark_sig(p_1b, fit1b)
plot(p_1b_1)

p_1c_1 <- mark_sig(p_1c, fit1c)
plot(p_1c_1)

#Advantages of SEM:
#1. Can test hypotheses based on multiple constructs that may be 
#indirectly or directly related for both linear and nonlinear models. 
#2. Compared to conventional multiple regression analyses, 
#SEM has greater statistical power

#Disadvanatages of SEM:
#1. Simultaneous examination of multiple variables requires larger sample sizes 
#for additional variables
#2. SEM cannot correct for weaknesses inherent in any type of study. 
#3. An a priori specification is necessary for making sense of the 
#statistical significance of relationships among variables

#Like any model, there are some advantages and disadvantages to SEM.

#1. The advantages of modeling using SEM compared to other techniques 
#boils down to two points. Firstly, with SEM you can test hypothesis 
#based on multiple constructs that may be indirectly or directly related for 
#both linear and nonlinear models. Secondly, compared to conventional 
#multiple regression analyses, SEM has greater statistical power. Regarding the 
#disadvantages of SEM, they mostly concern data structure and the quality of 
#your experimental design. The first disadvantage is that simultaneous 
#examination of multiple variables requires larger sample sizes for additional variables.

#SEM Capabilities: 
#SEM techniques can be applied to problems that range from strictly confirmatory 
#to highly exploratory. Other methods, such as principal components analysis (PCA), 
#do not have procedures for directly evaluating a-priori ideas about the 
#precise nature of underlying factors that explain a set of correlations.

#SEM Challenges:
#SEM in lavaan is for the most part pretty intuitive and user friendly, 
#however, we wanted to take some time to summarize potential challenges 
#you might encounter. If you’re going to encounter an error in your modeling, 
#it’s most likely going to be due to something with either your data structure 
#or the syntax that you’re using. Make sure you’ve taken the necessary steps 
#to remove NAs and make your data types for your selected variables are uniform.
#1. Data structure
#2. Syntax structure
#3. Lavaan vs SEM package comparison
#4. Data interpretation from outputs

#Results:
#1. Null/Baseline Model has a p-value of 0.0000 and R-squared value of 0.000.
#2. User Model has a p-value of 0.024 and R-squared value of 0.237.
#User Model rejected the null hypothesis (p > 0.000)
#The global p-value needs to be above p > 0.05 to be a good fitting model.

#“Does the hypothesized model fit the data well?” This is a critical question 
#in almost every application of structural equation modeling (SEM). 
#The model chi-square statistic and several fit indices are commonly reported 
#to address this question. Several model fit indices that are widely applied are 
#considered, all of which are based on a fit function given a specific estimation method. 

#The chi-square statistic (labeled as the minimum function test statistic) 
#is used to perform a test of perfect model fit, both for your specified and 
#null/baseline models. It essentially is a measure of deviance between your 
#model-implied variance/covariance matrix, and your observed 
#variance/covariance matrix. Our User Model rejected the null by having a (p < .001). 
#The global p-value for our user model is 0.024. 
#A perfect model or good quality model should have a p-value of (p> 0.05). 
#Though, the global p-value of our baseline model was 0.000. 
#Some statisticians (e.g., Klein, 2010) argue that the chi-square test of model 
#fit is useful in evaluating the quality of a model, but most others 
#discourage putting a lot stock in its interpretation, both for conceptual 
#(i.e., the null of perfect fit is unreasonable) and practical (i.e, chi-square test 
#is sensitive to sample size). RMSEA is an absolute fit index, in that it 
#assesses how far a hypothesized model is from a perfect model. 

#Fit indices that are considered include the root mean square error of 
#approximation (RMSEA; Steiger, 1990; Steiger & Lind, 1980). 
#Also, the comparative fit index (CFI; Bentler, 1990), and Tucker–Lewis index 
#(TLI; Bentler & Bonett, 1980; Tucker & Lewis, 1973) compares the fit of a 
#target model to the fit of an independent, or null, model. CFI and TLI larger 
#than .95 indicate relatively good model–data fit in general. CFI and TLI are 
#incremental fit indices that compare the fit of a hypothesized model (User Model) 
#with that of a baseline model (i.e., a model with the worst fit). 
#RMSEA value of < .05 indicates a “close fit,” and 
#that < .08 suggests a reasonable model–data fit. The RMSEA of our User Model 
#came to be 0.023, while the RMSEA of the baseline was 0.000.
#The CFI and TFI of our User Model came to be 0.999 and 0.996 respectively, 
#which is good because it should be > .90. The CFI and TFI for our 
#baseline model were 1.000 and 1.000.
#The application of RMSEA, CFI, and TLI is heavily contingent on a set of cutoff criteria. 

#Discussion Questions
#1.Based on the outputs of this SEM, do we have enough information to 
#justify this particular path?
#2. One potential flaw could be your ecological assumption or 
#missing variables that are not being considered? 
#(There could be limitations based on the data collection from sources) 
#3. What would best the approach to improve the quality of our SEM linear models?

#Future Research: Evaluate Extreme Heat Impacts (4th Variable)
#Research Questions - 
#1.Does Pollution impact Asthma?
#2.Does Pollution and Poverty impact Asthma?
#3.Does Pollution, Poverty, and Extreme Heat impact asthma?

#Null/Baseline Model - 
#1. Pollution Burden, Poverty, and Extreme Heat have a direct effect 
#on the rate of Asthma in California.

#User Model - 
#1. Pollution Burden, Poverty, and Extreme Heat have a direct effect on the 
#rate of Asthma in California. 
#2. Pollution and Extreme Heat have a direct effect on Poverty.
#3. Pollution Burden and Extreme Heat have a direct and 
#indirect effect on the rate of Asthma in California mediated by Poverty.